preference level
GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO
Zhao, Yiyang, Bai, Huiyu, Zhao, Xuejiao
The ability to train high-performing reward models with few-shot data is critical for enhancing the efficiency and scalability of Reinforcement Learning from Human Feedback (RLHF). We propose a data augmentation and expansion framework that enables generative reward models trained on small datasets to achieve comparable performance to those trained on large-scale datasets. Traditional methods to train a generative reward model, such as Direct Preference Optimization (DPO), are constrained by inefficiencies in sample pairing and limited data diversity. This work introduces preference refinement, which employs Chain-of-Thought (CoT) sampling to uncover diverse and high-quality preference relationships. It also incorporates a perplexity-based scoring mechanism to assign nuanced preference levels and utilizes Multi-level Direct Preference Optimization (M-DPO) to enable the model to capture finer-grained preference differences between samples. Experimental results demonstrate that the proposed method significantly enhances data efficiency and model performance, enabling reward models trained in a few-shot setting to achieve results on par with those trained on large-scale datasets. This study underscores the potential of data-efficient strategies in advancing reward model optimization, offering a robust solution for low-resource RLHF applications.
Unravelling multi-agent ranked delegations
Colley, Rachael, Grandi, Umberto, Novaro, Arianna
We introduce a voting model with multi-agent ranked delegations. This model generalises liquid democracy in two aspects: first, an agent's delegation can use the votes of multiple other agents to determine their own -- for instance, an agent's vote may correspond to the majority outcome of the votes of a trusted group of agents; second, agents can submit a ranking over multiple delegations, so that a backup delegation can be used when their preferred delegations are involved in cycles. The main focus of this paper is the study of unravelling procedures that transform the delegation ballots received from the agents into a profile of direct votes, from which a winning alternative can then be determined by using a standard voting rule. We propose and study six such unravelling procedures, two based on optimisation and four using a greedy approach. We study both algorithmic and axiomatic properties, as well as related computational complexity problems of our unravelling procedures for different restrictions on the types of ballots that the agents can submit.
Decision Making Over Combinatorially-Structured Domains
Martin, Andrea (Tulane University) | Venable, K. Brent (Tulane University)
We consider a scenario where a user must make a set of correlated decisions and we propose a computational modeling of the deliberation process. We assume the user compactly expresses her preferences via soft constraints. We consider a sequential procedure that uses Decision Field Theory to model the decision making on each variable. We test this procedure on randomly generated tree-shaped Fuzzy Constraint Satisfaction Problems. Our preliminary results showed that the time increases almost in the number of nodes. This is promising in terms of modeling decision over exponentially large domains. In the future, we plan to compare our results non-sequential approach and with behavioral data to asses our approach both in terms of modeling human decision making over complex domains, and adopting DFT as a means of incorporating a form of uncertainty into the soft constraint formalism.
Uncertainty in Soft Temporal Constraint Problems:A General Framework and Controllability Algorithms forThe Fuzzy Case
Rossi, F., Venable, K. B., Yorke-Smith, N.
In real-life temporal scenarios, uncertainty and preferences are often essential and coexisting aspects. We present a formalism where quantitative temporal constraints with both preferences and uncertainty can be defined. We show how three classical notions of controllability (that is, strong, weak, and dynamic), which have been developed for uncertain temporal problems, can be generalized to handle preferences as well. After defining this general framework, we focus on problems where preferences follow the fuzzy approach, and with properties that assure tractability. For such problems, we propose algorithms to check the presence of the controllability properties. In particular, we show that in such a setting dealing simultaneously with preferences and uncertainty does not increase the complexity of controllability testing. We also develop a dynamic execution algorithm, of polynomial complexity, that produces temporal plans under uncertainty that are optimal with respect to fuzzy preferences.
Uncertainty in Soft Temporal Constraint Problems:A General Framework and Controllability Algorithms forThe Fuzzy Case
Rossi, F., Venable, K. B., Yorke-Smith, N.
In real-life temporal scenarios, uncertainty and preferences are often essential and coexisting aspects. We present a formalism where quantitative temporal constraints with both preferences and uncertainty can be defined. We show how three classical notions of controllability (that is, strong, weak, and dynamic), which have been developed for uncertain temporal problems, can be generalized to handle preferences as well. After defining this general framework, we focus on problems where preferences follow the fuzzy approach, and with properties that assure tractability. For such problems, we propose algorithms to check the presence of the controllability properties. In particular, we show that in such a setting dealing simultaneously with preferences and uncertainty does not increase the complexity of controllability testing. We also develop a dynamic execution algorithm, of polynomial complexity, that produces temporal plans under uncertainty that are optimal with respect to fuzzy preferences.